AITopics | mobile manipulation

Collaborating Authors

mobile manipulation

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Introspective Planning: Aligning Robots ' Uncertainty with Inherent Task Ambiguity

Neural Information Processing SystemsFeb-16-2026, 07:02:24 GMT

The reliability of LLM outputs has direct implications downstream robotics tasks.

large language model, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Massachusetts > Suffolk County > Boston (0.04)
Asia > South Korea > Daegu > Daegu (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

UMI-on-Air: Embodiment-Aware Guidance for Embodiment-Agnostic Visuomotor Policies

Gupta, Harsh, Guo, Xiaofeng, Ha, Huy, Pan, Chuer, Cao, Muqing, Lee, Dongjae, Scherer, Sebastian, Song, Shuran, Shi, Guanya

arXiv.org Artificial IntelligenceDec-9-2025

We introduce UMI-on-Air, a framework for embodiment-aware deployment of embodiment-agnostic manipulation policies. Our approach leverages diverse, unconstrained human demonstrations collected with a handheld gripper (UMI) to train generalizable visuomotor policies. A central challenge in transferring these policies to constrained robotic embodiments-such as aerial manipulators-is the mismatch in control and robot dynamics, which often leads to out-of-distribution behaviors and poor execution. To address this, we propose Embodiment-Aware Diffusion Policy (EADP), which couples a high-level UMI policy with a low-level embodiment-specific controller at inference time. By integrating gradient feedback from the controller's tracking cost into the diffusion sampling process, our method steers trajectory generation towards dynamically feasible modes tailored to the deployment embodiment. This enables plug-and-play, embodiment-aware trajectory adaptation at test time. We validate our approach on multiple long-horizon and high-precision aerial manipulation tasks, showing improved success rates, efficiency, and robustness under disturbances compared to unguided diffusion baselines. Finally, we demonstrate deployment in previously unseen environments, using UMI demonstrations collected in the wild, highlighting a practical pathway for scaling generalizable manipulation skills across diverse-and even highly constrained-embodiments. All code, data, and checkpoints will be publicly released after acceptance. Result videos can be found at umi-on-air.github.io.

artificial intelligence, embodiment, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2510.02614

Genre: Research Report (0.50)

Industry: Energy (0.47)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

BINDER: Instantly Adaptive Mobile Manipulation with Open-Vocabulary Commands

Cho, Seongwon, Ahn, Daechul, Shin, Donghyun, Choi, Hyeonbeom, Kim, San, Choi, Jonghyun

arXiv.org Artificial IntelligenceDec-1-2025

Open-vocabulary mobile manipulation (OVMM) requires robots to follow language instructions, navigate, and manipulate while updating their world representation under dynamic environmental changes. However, most prior approaches update their world representation only at discrete update points such as navigation targets, waypoints, or the end of an action step, leaving robots blind between updates and causing cascading failures: overlooked objects, late error detection, and delayed replanning. To address this limitation, we propose BINDER (Bridging INstant and DEliberative Reasoning), a dual process framework that decouples strategic planning from continuous environment monitoring. Specifically, BINDER integrates a Deliberative Response Module (DRM, a multimodal LLM for task planning) with an Instant Response Module (IRM, a VideoLLM for continuous monitoring). The two modules play complementary roles: the DRM performs strategic planning with structured 3D scene updates and guides what the IRM attends to, while the IRM analyzes video streams to update memory, correct ongoing actions, and trigger replanning when necessary. Through this bidirectional coordination, the modules address the trade off between maintaining awareness and avoiding costly updates, enabling robust adaptation under dynamic conditions. Evaluated in three real world environments with dynamic object placement, BINDER achieves substantially higher success and efficiency than SoTA baselines, demonstrating its effectiveness for real world deployment.

artificial intelligence, large language model, natural language, (20 more...)

arXiv.org Artificial Intelligence

2511.22364

Genre: Research Report > New Finding (0.46)

Industry: Energy (0.47)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)

Add feedback

Generations in Dialogue: Embodied AI, robotics, perception, and action with Professor Roberto Martín-Martín

AIHubNov-27-2025, 14:29:28 GMT

Generations in Dialogue: Bridging Perspectives in AI is a podcast from AAAI featuring thought-provoking discussions between AI experts, practitioners, and enthusiasts from different age groups and backgrounds. Each episode delves into how generational experiences shape views on AI, exploring the challenges, opportunities, and ethical considerations that come with the advancement of this transformative technology. In the third episode of this new series from AAAI, host Ella Lan chats to Professor Roberto Martín-Martín about taking a screwdriver to his toys as a child, how his research focus has evolved over time, how different generations interact with technology, making robots for everyone, being inspired by colleagues, advice for early-career researchers, and how machines can enhance human capabilities. Roberto Martín-Martín is an Assistant Professor of Computer Science at the University of Texas at Austin, where his research integrates robotics, computer vision, and machine learning to build autonomous agents capable of perceiving, learning, and acting in the real world. He previously worked as an AI Researcher at Salesforce AI and as a Postdoctoral Scholar at the Stanford Vision and Learning Lab with Silvio Savarese and Fei-Fei Li, leading projects in visuomotor learning, mobile manipulation, and human-robot interaction.

artificial intelligence, dialogue, perception, (9 more...)

AIHub

Country:

North America > United States > Texas > Travis County > Austin (0.25)
Europe > Spain > Galicia > Madrid (0.05)

Genre:

Research Report > New Finding (0.36)
Overview (0.36)

Industry: Information Technology (0.53)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.74)

Add feedback

EchoVLA: Robotic Vision-Language-Action Model with Synergistic Declarative Memory for Mobile Manipulation

Lin, Min, Liang, Xiwen, Lin, Bingqian, Jingzhi, Liu, Jiao, Zijian, Li, Kehan, Ma, Yuhan, Liu, Yuecheng, Zhao, Shen, Zhuang, Yuzheng, Liang, Xiaodan

arXiv.org Artificial IntelligenceNov-25-2025

Recent progress in Vision-Language-Action (VLA) models has enabled embodied agents to interpret multimodal instructions and perform complex tasks. However, existing VLAs are mostly confined to short-horizon, table-top manipulation, lacking the memory and reasoning capability required for long-horizon mobile manipulation, where agents must coordinate navigation and manipulation under changing spatial contexts. In this work, we present EchoVLA, a memory-aware VLA model for long-horizon mobile manipulation. EchoVLA incorporates a synergistic declarative memory inspired by the human brain, consisting of a scene memory that maintains a collection of spatial-semantic maps and an episodic memory that stores task-level experiences with multimodal contextual features. During both training and inference, the two memories are individually stored, updated, and retrieved based on current observations, task history, and instructions, and their retrieved representations are fused via coarse- and fine-grained attention to guide mobile-arm diffusion policies. To support large-scale training and evaluation, we further introduce MoMani, an automated benchmark that generates expert-level long-horizon trajectories through multimodal large language model (MLLM)-guided planning and feedback-driven refinement, supplemented with real-robot demonstrations. Experiments in simulated and real-world settings show that EchoVLA improves long-horizon performance, reaching 0.52 SR on manipulation/navigation and 0.31 on mobile manipulation, exceeding $π_{0.5}$ by +0.08 and +0.11.

artificial intelligence, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

2511.18112

Genre: Research Report (0.50)

Industry: Health & Medicine (0.50)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.48)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.48)

Add feedback

MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation

Li, Chengshu, Xu, Mengdi, Bahety, Arpit, Yin, Hang, Jiang, Yunfan, Huang, Huang, Wong, Josiah, Garlanka, Sujay, Gokmen, Cem, Zhang, Ruohan, Liu, Weiyu, Wu, Jiajun, Martín-Martín, Roberto, Fei-Fei, Li

arXiv.org Artificial IntelligenceOct-22-2025

Imitation learning from large-scale, diverse human demonstrations has proven effective for training robots, but collecting such data is costly and time-consuming. This challenge is amplified for multi-step bimanual mobile manipulation, where humans must teleoperate both a mobile base and two high-degree-of-freedom arms. Prior automated data generation frameworks have addressed static bimanual manipulation by augmenting a few human demonstrations in simulation, but they fall short for mobile settings due to two key challenges: (1) determining base placement to ensure reachability, and (2) positioning the camera to provide sufficient visibility for visuomotor policies. To address these issues, we introduce MoMaGen, which formulates data generation as a constrained optimization problem that enforces hard constraints (e.g., reachability) while balancing soft constraints (e.g., visibility during navigation). This formulation generalizes prior approaches and provides a principled foundation for future methods. We evaluate MoMaGen on four multi-step bimanual mobile manipulation tasks and show that it generates significantly more diverse datasets than existing methods. Leveraging this diversity, MoMaGen can train successful imitation learning policies from a single source demonstration, and these policies can be fine-tuned with as few as 40 real-world demonstrations to achieve deployment on physical robotic hardware. More details are available at our project page: momagen.github.io.

artificial intelligence, demonstration, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2510.18316

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

HoMeR: Learning In-the-Wild Mobile Manipulation via Hybrid Imitation and Whole-Body Control

Sundaresan, Priya, Malhotra, Rhea, Miao, Phillip, Yang, Jingyun, Wu, Jimmy, Hu, Hengyuan, Antonova, Rika, Engelmann, Francis, Sadigh, Dorsa, Bohg, Jeannette

arXiv.org Artificial IntelligenceOct-14-2025

We introduce HoMeR, an imitation learning framework for mobile manipulation that combines whole-body control with hybrid action modes that handle both long-range and fine-grained motion, enabling effective performance on realistic in-the-wild tasks. At its core is a fast, kinematics-based whole-body controller that maps desired end-effector poses to coordinated motion across the mobile base and arm. Within this reduced end-effector action space, HoMeR learns to switch between absolute pose predictions for long-range movement and relative pose predictions for fine-grained manipulation, offloading low-level coordination to the controller and focusing learning on task-level decisions. We deploy HoMeR on a holonomic mobile manipulator with a 7-DoF arm in a real home. We compare HoMeR to baselines without hybrid actions or whole-body control across 3 simulated and 3 real household tasks such as opening cabinets, sweeping trash, and rearranging pillows. Across tasks, HoMeR achieves an overall success rate of 79.17% using just 20 demonstrations per task, outperforming the next best baseline by 29.17 on average. HoMeR is also compatible with vision-language models and can leverage their internet-scale priors to better generalize to novel object appearances, layouts, and cluttered scenes. In summary, HoMeR moves beyond tabletop settings and demonstrates a scalable path toward sample-efficient, generalizable manipulation in everyday indoor spaces. Code, videos, and supplementary material are available at: http://homer-manip.github.io

artificial intelligence, machine learning, natural language, (13 more...)

arXiv.org Artificial Intelligence

2506.01185

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.88)
Information Technology > Artificial Intelligence > Robots > Locomotion (0.46)

Add feedback

8451a20c5a7e0ee5671dda28f7daf7f3-Paper-Conference.pdf

Neural Information Processing SystemsOct-11-2025, 00:28:42 GMT

conformal prediction, instruction, prediction, (12 more...)

Neural Information Processing Systems

Country:

North America > United States > Massachusetts > Suffolk County > Boston (0.04)
Asia > South Korea > Daegu > Daegu (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Robots (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)

Add feedback

MobRT: A Digital Twin-Based Framework for Scalable Learning in Mobile Manipulation

Mei, Yilin, Qiu, Peng, Zhang, Wei, Zhang, WenChao, Song, Wenjie

arXiv.org Artificial IntelligenceOct-7-2025

Recent advances in robotics have been largely driven by imitation learning, which depends critically on large-scale, high-quality demonstration data. However, collecting such data remains a significant challenge-particularly for mobile manipulators, which must coordinate base locomotion and arm manipulation in high-dimensional, dynamic, and partially observable environments. Consequently, most existing research remains focused on simpler tabletop scenarios, leaving mobile manipulation relatively underexplored. To bridge this gap, we present \textit{MobRT}, a digital twin-based framework designed to simulate two primary categories of complex, whole-body tasks: interaction with articulated objects (e.g., opening doors and drawers) and mobile-base pick-and-place operations. \textit{MobRT} autonomously generates diverse and realistic demonstrations through the integration of virtual kinematic control and whole-body motion planning, enabling coherent and physically consistent execution. We evaluate the quality of \textit{MobRT}-generated data across multiple baseline algorithms, establishing a comprehensive benchmark and demonstrating a strong correlation between task success and the number of generated trajectories. Experiments integrating both simulated and real-world demonstrations confirm that our approach markedly improves policy generalization and performance, achieving robust results in both simulated and real-world environments.

artificial intelligence, machine learning, manipulation, (16 more...)

arXiv.org Artificial Intelligence

2510.04592

Country: Asia > China (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Robots > Robot Planning & Action (0.67)
Information Technology > Artificial Intelligence > Robots > Manipulation (0.46)

Add feedback

VBM-NET: Visual Base Pose Learning for Mobile Manipulation using Equivariant TransporterNet and GNNs

Naik, Lakshadeep, Fischer, Adam, Duberg, Daniel, Kragic, Danica

arXiv.org Artificial IntelligenceOct-7-2025

In Mobile Manipulation, selecting an optimal mobile base pose is essential for successful object grasping. Previous works have addressed this problem either through classical planning methods or by learning state-based policies. They assume access to reliable state information, such as the precise object poses and environment models. In this work, we study base pose planning directly from top-down orthographic projections of the scene, which provide a global overview of the scene while preserving spatial structure. We propose VBM-NET, a learning-based method for base pose selection using such top-down orthographic projections. We use equivariant TransporterNet to exploit spatial symmetries and efficiently learn candidate base poses for grasping. Further, we use graph neural networks to represent a varying number of candidate base poses and use Reinforcement Learning to determine the optimal base pose among them. We show that VBM-NET can produce comparable solutions to the classical methods in significantly less computation time. Furthermore, we validate sim-to-real transfer by successfully deploying a policy trained in simulation to real-world mobile manipulation.

artificial intelligence, base pose, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2510.04171

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback